I used to buy red wine from the liquid store a few years ago and enjoy the red wine when I am alone. I used to buy red wine made in France since I perfer the taste of red wine made here. But I don’t know any features that determine the quality of red wine. By using the data science technology we could analysis which features may lead to the best quality of red wine.
This report is about the red wine quality (https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt). We explore a dataset of red wine containing 1599 red wines and 12 attributes on the chemical variables of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). The attributes of red wine are as follow:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output attribute (based on sensory data): 12 - quality (score between 0 and 10)
## [1] 1599
## [1] 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Our dataset consists of 13 attributes, with 1599 observations.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
All the scores of wine quality are between 3 to 8. There is no wine of score 1, 2, 9 and 10. The score of wine quality seems distributed on a small range scope.
We are wondering what the plot looks like across the categorical attributes such as fixed.acidity, volatile.acidity and so on.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The lowest fixed acidity is 4.6 and highest is 15.9. Here I plot the main body of the fixed acidity. The distribution of fixed acidity seems to be skewed to positive skew.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The mean of volatile acidity is 0.5278 and median is 0.52. The shape of valatile acidity seems to be a positive skewed distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
The shape of citric acid is not clear. So I use the log scale in x coordinate to scale the distribution. There are large mount of wines with 0 citric acid and with 0.49 critic acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The mean of residual sugar is 2.539 and median is 2.2. The distribution seems to be positive skewed distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The mean of residual sugar is 0.08747 and median is 0.079. The distribution seems to be positive skewed distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
The mean of free sulfur dioxide is 15.87 and median is 14. The min of free sulfur dioxide is 1 and max is 72. The shape of distribution seems like a positive skewed distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The mean of free sulfur dioxide is 46.47 and median is 38. The min of free sulfur dioxide is 6 and max is 289. The shape of distribution seems like a positive skewed distribution.
The shape of free sulfur dioxide and total sulfur dioxide distributions seems corrlated. We are not sure if the weight of free sulfur dioxide is a propotion of the weight of total sulfur dioxide. So we plot the rate of free.sul.dioxide/total.sulfur.dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02273 0.25926 0.37500 0.38231 0.48485 0.85714
The distribution of sulfur dioxide rate, which is the division of free sulfur dioxide and total sulfur dioxide, is not a constant. The min of rate is 0.227 and max is 0.857. The shape of rate distribution seems like a normal distribution. So we are not sure the relation between the mount of free sulfur dioxide and the mount of total sulfur dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The mean of density is 0.9967 and median is 0.9968. The min of density is 0.9901 and max is 1.0037. The shape of distribution seems like a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The mean of pH is 3.311 and median is 3.310. The min of pH is 2.74 and max is 4.01. The shape of distribution seems like a normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The shape of sulphates distribution seems like a positive skewed distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The mean of free sulfur dioxide is 10.42 and median is 10.20. The min of free sulfur dioxide is 8.4 and max is 14.9. The shape of distribution and log scaled distribution are not clear and not correlate to the quality of red wine. In reality the alcohol in red wine seems not relate to the quality of red wine. This two diagrams may be implied by the reality situations.
There are 1599 red winds in the dataset with 11 features(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol). All the features types are num except the type of quality is int. We also have the following observations:
The main features of the red wine dataset are the quality. I would like to find the other features such as fixed acidity, alcohol and so on to find the correlation relation between these attributes and quality. We’d like to find which features are best for predicting the quality of a red wine.
There are many features may contribute to the quality of red wines: fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates. We are not sure in current stage which feature may contribute the quality most. But the feature such as alcohol and citric acid may not contribute to the quality of red wine.
We create the sulfur dioxide rate, which is the rate of free sulfur dioxide and total sulfur dioxide. The shape of two distributions seems similar so we are interested to find if there is any closed relation between two attributes. But the distribution of sulfur dioxide rate seem like a normal distribution with mean 0.382. It is hard to get any conclusion on relation between free sulfur dioxide and total sulfur dioxide.
I log-transformed the citric.acid and alcohol since the distributions of these two diagrams seems not clear. But even I applied the log-transform, I still can not get a clear distribution form on citric.acid and alcohol. The features such as citric.acid and alcohol may not have strong correlate to the quality of red wines.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00 -0.26 0.67
## volatile.acidity -0.26 1.00 -0.55
## citric.acid 0.67 -0.55 1.00
## residual.sugar 0.11 0.00 0.14
## chlorides 0.09 0.06 0.20
## free.sulfur.dioxide -0.15 -0.01 -0.06
## total.sulfur.dioxide -0.11 0.08 0.04
## density 0.67 0.02 0.36
## pH -0.68 0.23 -0.54
## sulphates 0.18 -0.26 0.31
## alcohol -0.06 -0.20 0.11
## quality 0.12 -0.39 0.23
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.11 0.09 -0.15
## volatile.acidity 0.00 0.06 -0.01
## citric.acid 0.14 0.20 -0.06
## residual.sugar 1.00 0.06 0.19
## chlorides 0.06 1.00 0.01
## free.sulfur.dioxide 0.19 0.01 1.00
## total.sulfur.dioxide 0.20 0.05 0.67
## density 0.36 0.20 -0.02
## pH -0.09 -0.27 0.07
## sulphates 0.01 0.37 0.05
## alcohol 0.04 -0.22 -0.07
## quality 0.01 -0.13 -0.05
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.11 0.67 -0.68 0.18 -0.06
## volatile.acidity 0.08 0.02 0.23 -0.26 -0.20
## citric.acid 0.04 0.36 -0.54 0.31 0.11
## residual.sugar 0.20 0.36 -0.09 0.01 0.04
## chlorides 0.05 0.20 -0.27 0.37 -0.22
## free.sulfur.dioxide 0.67 -0.02 0.07 0.05 -0.07
## total.sulfur.dioxide 1.00 0.07 -0.07 0.04 -0.21
## density 0.07 1.00 -0.34 0.15 -0.50
## pH -0.07 -0.34 1.00 -0.20 0.21
## sulphates 0.04 0.15 -0.20 1.00 0.09
## alcohol -0.21 -0.50 0.21 0.09 1.00
## quality -0.19 -0.17 -0.06 0.25 0.48
## quality
## fixed.acidity 0.12
## volatile.acidity -0.39
## citric.acid 0.23
## residual.sugar 0.01
## chlorides -0.13
## free.sulfur.dioxide -0.05
## total.sulfur.dioxide -0.19
## density -0.17
## pH -0.06
## sulphates 0.25
## alcohol 0.48
## quality 1.00
It seems most of attributes are not highly correlate with each other, especially the quality attribute and other attributes. But there are still some attributes that are correlate with each other, for instance, volatile.acidity and citric.acid, fixed.acidity and pH and so on. We analyze the correlated relation between quality attributes and other attributes. We first use the two most correlated varaibles: alcohol and volatile.acidity to evaluate the relation.
It seems the quality increase if the alcohol degree increase and volatile.acidity decrease. But the correlation relation between quality and either of the alcohol and volatile.acidity seems quite weak.
So we try to explore the other highest six correlated attributes: sulphates, citric.acid and so on, to check if there exists any strong correlation relationship between any of these attributes and quality.
It seems all these attributes are not strongly correlated to quality as alcohol and volatile.acidity.
Then we try to find the linear relation between quality and any of the attributes such as alcohol, volatile.acidity and so on, using linear regression.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = redwine)
## m2: lm(formula = I(quality) ~ I(volatile.acidity), data = redwine)
## m3: lm(formula = I(quality) ~ I(sulphates), data = redwine)
## m4: lm(formula = I(quality) ~ I(citric.acid), data = redwine)
## m5: lm(formula = I(quality) ~ I(fixed.acidity), data = redwine)
## m6: lm(formula = I(quality) ~ I(chlorides), data = redwine)
## m7: lm(formula = I(quality) ~ I(total.sulfur.dioxide), data = redwine)
## m8: lm(formula = I(quality) ~ I(density), data = redwine)
##
## ====================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8
## --------------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 6.566*** 4.848*** 5.382*** 5.157*** 5.829*** 5.847*** 80.239***
## (0.175) (0.058) (0.078) (0.034) (0.098) (0.042) (0.034) (10.508)
## I(alcohol) 0.361***
## (0.017)
## I(volatile.acidity) -1.761***
## (0.104)
## I(sulphates) 1.198***
## (0.115)
## I(citric.acid) 0.938***
## (0.101)
## I(fixed.acidity) 0.058***
## (0.012)
## I(chlorides) -2.212***
## (0.426)
## I(total.sulfur.dioxide) -0.005***
## (0.001)
## I(density) -74.846***
## (10.542)
## --------------------------------------------------------------------------------------------------------------------
## R-squared 0.2 0.2 0.1 0.1 0.0 0.0 0.0 0.0
## adj. R-squared 0.2 0.2 0.1 0.1 0.0 0.0 0.0 0.0
## sigma 0.7 0.7 0.8 0.8 0.8 0.8 0.8 0.8
## F 468.3 287.4 107.7 86.3 25.0 27.0 56.7 50.4
## p 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -1721.1 -1794.3 -1874.4 -1884.6 -1914.2 -1913.2 -1898.8 -1901.8
## Deviance 805.9 883.2 976.3 988.8 1026.1 1024.8 1006.5 1010.3
## AIC 3448.1 3594.6 3754.9 3775.2 3834.5 3832.5 3803.5 3809.6
## BIC 3464.2 3610.8 3771.0 3791.3 3850.6 3848.6 3819.7 3825.7
## N 1599 1599 1599 1599 1599 1599 1599 1599
## ====================================================================================================================
It seems that any of the Attributes are weakly correlated to the quality. The relation between any of these attributes and the quality seems to be non-linear. Based on the R^2 value, alcohol or volatile.acidity has the highest linear contribution to the quality score, but only explain around at most 20 percent of the variance in quality.
Next we check these eight attributes and see their variation with quality.
For the first four attributes: alcohol, volatile.acidity, sulphates and citric.acid, we see that the quality increase or decrease as these attributes increase. For volatile.acidity, the variation seems decrease as teh quality increase. But for other Attributes, the change of variations are not obvious.
We explore other highly correlated attributes: First we analyze pH and fixed.acidity.
The fixed.acidity and pH are strongly correlated. The value of fixed.acidity decreases as the pH value increase. It explains that if there are more acidity in the wine, the pH value will decrease, vice vase.
Next, we explore the other highest correlated relations such as density and fixed.acidity and so on.
These diagrams are the other four closely correlated relationship between Attributes in red wine. From these picture we found that:
The increase of density of red wine will increase the fixed acidity. It explains that the fixed acidity is a fixed ingredient in the red wine. The increase other ingredient will increase fixed acidity as well.
The increase of citric acid will increase fixed acidity. May be these two ingredient in the red wine are added together.
The increase of total sulfur dioxide will increase the free sulfur dioxide. It seems the free sulfur dioxide is a component of total sulfur dioxide. So the increase of free sulfur dioxide will lead to the density increase of the sulfur dioxide.
The volatile acidity will make the citric acid decrease. It seems these two attributes could not co-exist in the red wine. Each component increasing will lead to the other fail down.
The quality of red wine are not strongly correlated to any of single variables. We first test the highest two correlated attributes with quality, that is alcohol and volatile acidity. We find that these two attributes are not high correlated with the quality. The relationship between quality and alcohol or volatile.acidity seems non-linear. Based on R^2 value, the alcohol or volatile.acidity explains about only at most 20 percent of the variance in quality.
Then we test the other six features that seems still correlate with quality but none of them has strong correlation relation with the quality attribute. Based on R^2 value, the sulphates contributes at most 10 percent of the variance in quality among all the other features. It seems any of the red wine attributes are not linear correlated with quality. We could not predict the quality of red wind by using any of the features. So we would explore the combination of these attributes to find the linear relation with quality in the next section.
The variation of these attributes with quality is not strong as well. But we could find that the variation of volatile acidity will decrease as the quality value increase.
The fixed.acidity and pH seems closely correlated with each other. The fixed.acidity decreases as the pH increase. It explains that if there are more acidity in the wine, the pH value will decrease, vice vase. There are also the other four closely correlated relationship between Attributes in red wine. From these picture we found that:
The increase of density of red wine will increase the fixed acidity. It explains that the fixed acidity is a fixed ingredient in the red wine. The increase other ingredient will increase fixed acidity as well.
The increase of citric acid will increase fixed acidity. May be these two ingredient in the red wine are added together.
The increase of total sulfur dioxide will increase the free sulfur dioxide. It seems the free sulfur dioxide is a component of total sulfur dioxide. So the increase of free sulfur dioxide will lead to the density increase of the sulfur dioxide.
The volatile acidity will make the citric acid decrease. It seems these two attributes could not co-exist in the red wine. Each component increasing will lead to the other fail down.
The strongest relation is fixed.acidity and pH. The correlation value betwen pH and fixed acidity is around -0.68. The fixed.acidity is also closely correlated with density but not as strong as pH.
We first explore the density plots fo the most correlated two attributes: alcohol and volatile acidity, for different quality values.
From the density plots, we see the better quality of red wine tend to occur more often at high alcohol density. The worse quality of red wine tend to occur more often at high volatile acidity density.
In last section, we found that alcohol has the most correlation relationship with the quality of wine. So we are interested in exploring here which attributes and alcohol together would make a better quality of red wine. We select three attributes: volatile.acid, pH, and free.sulfur.dioxide, to check if any combination of two attributes would get better quality of red wine.
The first diagram indicates the better quality of red wine should have low pH and high degree of alcohol.
The second diagram indicates better quality of red wine should have low volitile acidity and high alcohol.
The last indicates better red wind should have low free sulfur dioxide and high alcohol. All these three diagram show the combination of two attribution, with one attributes as alcohol, would generate a better taste of red wines.
We also explore other correlatd attributes and their contributions on quality of red wine.
From these diagrams, we found that the quality of red wine are closely related to a few group of attributes.
If the value of density fixed, the increase of fixed acidity would make a better quality of red wine.
If the value of fixed acidity and citric acid increase, the quality of red wine also increase.
If the free sulfur dioxide and total sulfur dioxide increase, the quality of red wine also increase. Since free sulfur dioxide and total sulfur dioxide attributes are closey correlated, any one of the two attributes change would contribute to the quality of red wine change.
The decrease of volatile acidity and increase citirc acid would lead to a better quality of red wine.
Finally, we explore the linear relation between quality and highest 8 correlatd attributes
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = redwine)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = redwine)
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates,
## data = redwine)
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates +
## citric.acid, data = redwine)
## m5: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates +
## citric.acid + fixed.acidity, data = redwine)
## m6: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates +
## citric.acid + fixed.acidity + chlorides, data = redwine)
## m7: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates +
## citric.acid + fixed.acidity + chlorides + total.sulfur.dioxide,
## data = redwine)
## m8: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates +
## citric.acid + fixed.acidity + chlorides + total.sulfur.dioxide +
## density, data = redwine)
##
## =================================================================================================================
## m1 m2 m3 m4 m5 m6 m7 m8
## -----------------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 2.611*** 2.646*** 2.202*** 2.363*** 2.652*** 28.165
## (0.175) (0.184) (0.196) (0.201) (0.224) (0.228) (0.240) (15.083)
## I(alcohol) 0.361*** 0.314*** 0.309*** 0.309*** 0.320*** 0.304*** 0.288*** 0.268***
## (0.017) (0.016) (0.016) (0.016) (0.016) (0.017) (0.017) (0.021)
## volatile.acidity -1.384*** -1.221*** -1.265*** -1.343*** -1.239*** -1.173*** -1.137***
## (0.095) (0.097) (0.113) (0.113) (0.117) (0.118) (0.120)
## sulphates 0.679*** 0.696*** 0.701*** 0.851*** 0.888*** 0.916***
## (0.101) (0.103) (0.103) (0.111) (0.111) (0.112)
## citric.acid -0.079 -0.469*** -0.335* -0.203 -0.198
## (0.104) (0.137) (0.141) (0.145) (0.145)
## fixed.acidity 0.057*** 0.050*** 0.037** 0.055**
## (0.013) (0.013) (0.014) (0.017)
## chlorides -1.430*** -1.576*** -1.584***
## (0.408) (0.408) (0.408)
## total.sulfur.dioxide -0.002*** -0.002***
## (0.001) (0.001)
## density -25.583
## (15.122)
## -----------------------------------------------------------------------------------------------------------------
## R-squared 0.2 0.3 0.3 0.3 0.3 0.3 0.4 0.4
## adj. R-squared 0.2 0.3 0.3 0.3 0.3 0.3 0.4 0.4
## sigma 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.6
## F 468.3 370.4 268.9 201.8 167.0 142.2 124.9 109.8
## p 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -1721.1 -1621.8 -1599.4 -1599.1 -1589.6 -1583.5 -1576.5 -1575.1
## Deviance 805.9 711.8 692.1 691.9 683.7 678.5 672.6 671.4
## AIC 3448.1 3251.6 3208.8 3210.2 3193.3 3183.0 3171.1 3170.2
## BIC 3464.2 3273.1 3235.7 3242.4 3230.9 3226.0 3219.5 3224.0
## N 1599 1599 1599 1599 1599 1599 1599 1599
## =================================================================================================================
Despite we include more correlatd varaibles (total 8), based on the R^2 value, the quality of red wine still be explained 40 percent by all the 8 variables. So the linear relationship between the combination of attributes and the quality of red wine is still weak, even though we combine more attributions to explore the linear relation with quality.
We find there are a few correlated features that could strenthen each other.
If the value of density fixed, the increase of fixed acidity would also increase. The increasing of two attributes make a better quality of red wine.
If the value of fixed acidity, the citric acid increase. The increasing of both attributes lead to the quality of red wine increase.
If the free sulfur dioxide, the total sulfur dioxide increase. The increasing of both attributes lead to the quality of red wine increase. Since free sulfur dioxide and total sulfur dioxide attributes are closey correlated, any one of the two attributes change would contribute to the quality of red wine change.
The decrease of volatile acidity will increase citirc acid. And low volatile acidity and high ritric acid would lead to a better quality of red wine.
We select the two most correlated attributes with the quality score: alcohol and volatile acidity. We found the two attribute could lead to 30 percent contribution of variation of quality value. The other attributes does not contribute to the quality very much. We also find that if we want to make a better red wine, it is reasonable to choose high alcohol and volatile acidity. This new discovered rule may help the wine producter to produce good quality of red wine. It may also help me to pick good quality of red wine when I buy red wine on liquid store, by checking the two index.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
This diagram describe volatile acidity highly effects on the quality of red wine. The increasing of the volatile acidity will make a better quality of red wines.
This diagram describe alcohol highly effects on the quality of red wine. The increasing of the alcohol will make a better quality of red wines.
Based on the correlation test and linear regression and the previous two plots on alchcol vs quality and volatile acidity vs quality, we found these two attributes alchcol and volatile acidity are highly correlated with quality of red wine. Here we plot the two ingredient together to see how the red wine quality are related to the combination of two variables. We found that the increasing of alcohol and decreasing volatile acidity would relate to a better quality of red wine.
In this report, we explore the quality of red wine, and its relationship with other features that made red wine. Our analysis are explored by three sections.
In our first section, we explore each individual attribute of red wines. We use the histgram to count the number of each values on each attributes. We also test the maximum, minimum, mean and median qualities for each attributes.
In our second section, we explore our data by exploring the correlated relations between two attributes. Especially we are interested in the relationship between quality of red wine and the other 11 attributes. We hope to find if there exist any strong correlation to predict the quality of red wine by any one of the 11 attributes. We also explore the relationship betwen other closely related attributes.
In our third section, we explore teh multivariant analysis for a few attributes. Espeically we explore the combination of a few attributes and their correlation with quality score. We also explore a few pairs of closely related features, to explore these values with the quality score.
We find a few interesting result on red wine quality and ingredient attributes.
The alcohol and volatile acidity together are closely correlated to the red wine quality. Based on R^2 value, the combination of alcohol and volatile acidity explains 30 percent of the variance of quality. The increasing of alcohol and decrease of volatile acidity will produce a better quality of red wine.
The other features of red wine seems hard to correlate with the quality score. The combination of top 8 attributes, including the alcohol and volatile acidity, explains only 40 percent of the variance of quality.
Some attributes are correlated. Free sulfur dioxide and total sulfur dioxide are positively correlated. The increasing of free sulfur dioxide increase the total sulfur dioxide. The pH and fixed acidity are negtively correlated. The decreasing of fixed acidity increase the pH value.
Most quality scores are ranked 5 and 6. There is no high score such as 9 and 10 or low score like 0, 1, 2.
We found the two most correlated features for the red wine quality: alcohol and volatile acidity. These correlation relation give us some hints on how to pick the good quality red wine in market.
One of the difficulites that we have when we analyze the data is the unfamilarity of the measurement of red wines. There are a few chemical terms that is unfamilar to non chemistriy specialists. Though we could get some correlated relationship between a few attributes, we can not explain the real reason for the correction of two attributes.
The correlated relation between our interested attributes: qualtiy score and other attributres such as alcohol are not very strongly correlated. We have to find a few attributes and combine them together to find the possible relation between red wine quality and these attributes.
There are a few suggestions to improve the data analysis of red wine quality in future. First, the data set contains a few records of red wine quality test, that is 1599. We could include more red wine test to gain a better analysis. Second, the range of score of quality is very limited. Most score are 5, 6, or 7, and there are no score at 0, 1, 2, 9, 10. It makes hard to distinguish the quality of red wine with such few score ranks marked. We could set the score in range of 0 to 100, or allow the float types of quality score. With more data records in data set and more quality score range, we could get a better analysis result on the quality of red wine.